On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.
Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.
Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel prize and our world more generally?
Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
Both ! and % allow you to run shell commands from a Jupyter notebook. See saved links in folder IPython magic commands ! $ % etc - run shell commands for difference.
!pip install --upgrade plotly
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.float_format = '{:,.2f}'.format
df_data = pd.read_csv('nobel_prize_data.csv')
Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.
Challenge: Preliminary data exploration.
df_data? How many rows and columns?df_data.sample(3)
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 693 | 1998 | Medicine | The Nobel Prize in Physiology or Medicine 1998 | "for their discoveries concerning nitric oxide... | 1/3 | Individual | Robert F. Furchgott | 1916-06-04 | Charleston, SC | United States of America | United States of America | Male | SUNY Health Science Center | New York, NY | United States of America | USA |
| 501 | 1979 | Peace | The Nobel Peace Prize 1979 | NaN | 1/1 | Organization | Mother Teresa | 1910-08-26 | Uskup (Skopje) | Ottoman Empire (Republic of Macedonia) | Republic of Macedonia | Female | NaN | NaN | NaN | MKD |
| 828 | 2009 | Physics | The Nobel Prize in Physics 2009 | "for the invention of an imaging semiconductor... | 1/4 | Individual | Willard S. Boyle | 1924-08-19 | Amherst, NS | Canada | Canada | Male | Bell Laboratories | Murray Hill, NJ | United States of America | CAN |
df_data.shape
(962, 16)
--> big dataset with 16 columns
df_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 962 entries, 0 to 961 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 962 non-null int64 1 category 962 non-null object 2 prize 962 non-null object 3 motivation 874 non-null object 4 prize_share 962 non-null object 5 laureate_type 962 non-null object 6 full_name 962 non-null object 7 birth_date 934 non-null object 8 birth_city 931 non-null object 9 birth_country 934 non-null object 10 birth_country_current 934 non-null object 11 sex 934 non-null object 12 organization_name 707 non-null object 13 organization_city 707 non-null object 14 organization_country 708 non-null object 15 ISO 934 non-null object dtypes: int64(1), object(15) memory usage: 120.4+ KB
We could potentially convert "year" and "birth_date" into datetime objects.
print(f"The first Nobel prize was awarded in {df_data.year.min()}")
The first Nobel prize was awarded in 1901
print(f"The latest year in our dataset is {df_data.year.max()}")
The latest year in our dataset is 2020
Challange:
df_data.duplicated().any()
False
df_data.isna().any().any()
True
df_data.isna().any().loc[df_data.isna().any()].to_frame().transpose()
| motivation | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | True | True | True | True | True | True | True | True | True | True |
These are the columns that host NaN values.
df_data.isna().sum().loc[df_data.isna().sum() != 0].to_frame().transpose()
| motivation | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 88 | 28 | 31 | 28 | 28 | 28 | 255 | 255 | 254 | 28 |
We can see above the number of NaN values for each column that contains some.
A lot of laureates are no associated with an organization (orga. name, city, country).
The most important columns(year, category, prize, full_name) don't have NaN values so the dataset is good.
Let us filter on the NaN values in the birth date columns:
col_subset = ['year','category', 'laureate_type',
'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset] # we can reuse a subset like this!
| year | category | laureate_type | birth_date | full_name | organization_name | |
|---|---|---|---|---|---|---|
| 24 | 1904 | Peace | Organization | NaN | Institut de droit international (Institute of ... | NaN |
| 60 | 1910 | Peace | Organization | NaN | Bureau international permanent de la Paix (Per... | NaN |
| 89 | 1917 | Peace | Organization | NaN | Comité international de la Croix Rouge (Intern... | NaN |
| 200 | 1938 | Peace | Organization | NaN | Office international Nansen pour les Réfugiés ... | NaN |
| 215 | 1944 | Peace | Organization | NaN | Comité international de la Croix Rouge (Intern... | NaN |
| 237 | 1947 | Peace | Organization | NaN | American Friends Service Committee (The Quakers) | NaN |
| 238 | 1947 | Peace | Organization | NaN | Friends Service Council (The Quakers) | NaN |
| 283 | 1954 | Peace | Organization | NaN | Office of the United Nations High Commissioner... | NaN |
| 348 | 1963 | Peace | Organization | NaN | Comité international de la Croix Rouge (Intern... | NaN |
| 349 | 1963 | Peace | Organization | NaN | Ligue des Sociétés de la Croix-Rouge (League o... | NaN |
| 366 | 1965 | Peace | Organization | NaN | United Nations Children's Fund (UNICEF) | NaN |
| 399 | 1969 | Peace | Organization | NaN | International Labour Organization (I.L.O.) | NaN |
| 479 | 1977 | Peace | Organization | NaN | Amnesty International | NaN |
| 523 | 1981 | Peace | Organization | NaN | Office of the United Nations High Commissioner... | NaN |
| 558 | 1985 | Peace | Organization | NaN | International Physicians for the Prevention of... | NaN |
| 588 | 1988 | Peace | Organization | NaN | United Nations Peacekeeping Forces | NaN |
| 659 | 1995 | Peace | Organization | NaN | Pugwash Conferences on Science and World Affairs | NaN |
| 682 | 1997 | Peace | Organization | NaN | International Campaign to Ban Landmines (ICBL) | NaN |
| 703 | 1999 | Peace | Organization | NaN | Médecins Sans Frontières | NaN |
| 730 | 2001 | Peace | Organization | NaN | United Nations (U.N.) | NaN |
| 778 | 2005 | Peace | Organization | NaN | International Atomic Energy Agency (IAEA) | NaN |
| 788 | 2006 | Peace | Organization | NaN | Grameen Bank | NaN |
| 801 | 2007 | Peace | Organization | NaN | Intergovernmental Panel on Climate Change (IPCC) | NaN |
| 860 | 2012 | Peace | Organization | NaN | European Union (EU) | NaN |
| 873 | 2013 | Peace | Organization | NaN | Organisation for the Prohibition of Chemical W... | NaN |
| 897 | 2015 | Peace | Organization | NaN | National Dialogue Quartet | NaN |
| 919 | 2017 | Peace | Organization | NaN | International Campaign to Abolish Nuclear Weap... | NaN |
| 958 | 2020 | Peace | Organization | NaN | World Food Programme (WFP) | NaN |
We see that we get back a bunch of organisations, like the UN or the Red Cross. That makes sense. We also see that since the organisation's name is in the full_name column, the organisation_name column contains NaN.
In addition, let's filter on the NaN values in the organization_name:
df_data.loc[df_data.organization_name.isna()][col_subset]
| year | category | laureate_type | birth_date | full_name | organization_name | |
|---|---|---|---|---|---|---|
| 1 | 1901 | Literature | Individual | 1839-03-16 | Sully Prudhomme | NaN |
| 3 | 1901 | Peace | Individual | 1822-05-20 | Frédéric Passy | NaN |
| 4 | 1901 | Peace | Individual | 1828-05-08 | Jean Henry Dunant | NaN |
| 7 | 1902 | Literature | Individual | 1817-11-30 | Christian Matthias Theodor Mommsen | NaN |
| 9 | 1902 | Peace | Individual | 1843-05-21 | Charles Albert Gobat | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 932 | 2018 | Peace | Individual | 1993-07-02 | Nadia Murad | NaN |
| 942 | 2019 | Literature | Individual | 1942-12-06 | Peter Handke | NaN |
| 946 | 2019 | Peace | Individual | 1976-08-15 | Abiy Ahmed Ali | NaN |
| 954 | 2020 | Literature | Individual | 1943-04-22 | Louise Glück | NaN |
| 958 | 2020 | Peace | Organization | NaN | World Food Programme (WFP) | NaN |
255 rows × 6 columns
We also see that many prizes went to people who were not affiliated with a university or research institute. This includes many of the Literature and Peace prize winners.
Challenge:
birth_date column to Pandas Datetime objectsshare_pct which has the laureates' share as a percentage in the form of a floating-point number.df_data.birth_date = pd.to_datetime(df_data.birth_date)
The .str.split() method returns a Series where values are lists of two items such as [1, 1]. Then we use .map() or .apply() to get hold of the first or second element. See StackOverflow post
numerator = df_data.prize_share.str.split('/').map(lambda x: x[0]).astype('int64')
denominator = df_data.prize_share.str.split('/').map(lambda x: x[1]).astype('int64')
df_data.insert(loc=5, column='share_pct',
value=numerator/denominator) # loc to specify where to insert the column
Alternatively, I could have used the expand parameter of the .split() method, e.g. separated_values = df_data.prize_share.str.split('/', expand=True)
To select specic part of a dataframe: df.iloc[row_start:row_end , col_start, col_end]
df_data.iloc[2:4, :6]
| year | category | prize | motivation | prize_share | share_pct | |
|---|---|---|---|---|---|---|
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | 1.00 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | 0.50 |
Sidenote: The .T attribute of a df directly returns the transposed version, e.g. df.T
Challenge: Create a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?
Note: In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types), e.g.
df_people = df_data.loc[~ df_data.sex.isna()] # to exclude the organizations
prize_per_gender = df_data.groupby('sex', dropna=False).agg({'prize': 'count'})
prize_per_gender
| prize | |
|---|---|
| sex | |
| Female | 58 |
| Male | 876 |
| NaN | 28 |
More elegant solution, from the course: df_data.sex.value_counts(dropna=False).to_frame()
fig1 = px.pie(prize_per_gender, values='prize', names=prize_per_gender.index,
title='Nobel Prize by gender', hole=0.5, height=500)
fig1.update_traces(textfont_size=15)
fig1.show()
Challenge:
birth_country? Were they part of an organisation?three_first_females = df_data.loc[df_data.sex=='Female']\
.sort_values('year').head(3)
three_first_females.full_name.values
array(['Marie Curie, née Sklodowska',
'Baroness Bertha Sophie Felicita von Suttner, née Countess Kinsky von Chinic und Tettau',
'Selma Ottilia Lovisa Lagerlöf'], dtype=object)
three_first_females.drop(['organization_city', 'organization_country',\
'motivation', 'laureate_type', 'prize'], axis='columns')
| year | category | prize_share | share_pct | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | ISO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18 | 1903 | Physics | 1/4 | 0.25 | Marie Curie, née Sklodowska | 1867-11-07 | Warsaw | Russian Empire (Poland) | Poland | Female | NaN | POL |
| 29 | 1905 | Peace | 1/1 | 1.00 | Baroness Bertha Sophie Felicita von Suttner, n... | 1843-06-09 | Prague | Austrian Empire (Czech Republic) | Czech Republic | Female | NaN | CZE |
| 51 | 1909 | Literature | 1/1 | 1.00 | Selma Ottilia Lovisa Lagerlöf | 1858-11-20 | Mårbacka | Sweden | Sweden | Female | NaN | SWE |
Challenge: Did some people get a Nobel Prize more than once? If so, who were they?
df_data.duplicated(subset=['full_name']).any()
True
df_data.full_name.loc[df_data.duplicated(subset=['full_name'],
keep=False)].to_frame()
| full_name | |
|---|---|
| 18 | Marie Curie, née Sklodowska |
| 62 | Marie Curie, née Sklodowska |
| 89 | Comité international de la Croix Rouge (Intern... |
| 215 | Comité international de la Croix Rouge (Intern... |
| 278 | Linus Carl Pauling |
| 283 | Office of the United Nations High Commissioner... |
| 297 | John Bardeen |
| 306 | Frederick Sanger |
| 340 | Linus Carl Pauling |
| 348 | Comité international de la Croix Rouge (Intern... |
| 424 | John Bardeen |
| 505 | Frederick Sanger |
| 523 | Office of the United Nations High Commissioner... |
Challenge:
Aggrnyl to colour the chart, but don't show a color axis.df_data.category.nunique()
6
category_count = df_data.category.value_counts()
category_count.index
Index(['Medicine', 'Physics', 'Chemistry', 'Peace', 'Literature', 'Economics'], dtype='object')
fig2 = px.bar(x=category_count.index, y=category_count.values,
height=500, title='Nobel Prizes per Category',
labels={'x': 'Category', 'y': 'Count'}, color_continuous_scale='Aggrnyl',
color=category_count.values)
fig2.update_layout(coloraxis_showscale=False)
fig2.show()
Challenge:
df_data.loc[df_data.category == 'Economics']\
.sort_values('year').iloc[0].to_frame().T
| year | category | prize | motivation | prize_share | share_pct | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 393 | 1969 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for having developed and applied dynamic mode... | 1/2 | 0.50 | Individual | Jan Tinbergen | 1903-04-12 | the Hague | Netherlands | Netherlands | Male | The Netherlands School of Economics | Rotterdam | Netherlands | NLD |
Challenge: Create a plotly bar chart that shows the split between men and women by category.
category_and_gender = df_data.groupby(['category', 'sex'],
as_index=False, dropna=False).agg({'prize': 'count'})
category_and_gender.rename(columns={'prize': 'Count', 'sex': 'Gender',\
'category': 'Category'}, inplace=True)
category_and_gender.sort_values('Gender', ascending=False,
inplace=True) # just to have the stacked columns with male first
category_and_gender.fillna(value='Organization', inplace=True)
category_and_gender
| Category | Gender | Count | |
|---|---|---|---|
| 1 | Chemistry | Male | 179 |
| 3 | Economics | Male | 84 |
| 5 | Literature | Male | 101 |
| 7 | Medicine | Male | 210 |
| 9 | Peace | Male | 90 |
| 12 | Physics | Male | 212 |
| 0 | Chemistry | Female | 7 |
| 2 | Economics | Female | 2 |
| 4 | Literature | Female | 16 |
| 6 | Medicine | Female | 12 |
| 8 | Peace | Female | 17 |
| 11 | Physics | Female | 4 |
| 10 | Peace | Organization | 28 |
# fig3 = px.bar(category_and_gender, x='category', y='count', color='sex')
fig3 = px.bar(category_and_gender, x='Category', y='Count', color='Gender',
height=500, title='Nobel Prizes by Category and Gender')
fig3.update_xaxes(categoryorder='total descending')
fig3.show()
16 Nobel Prizes in Literature for females and 4 in Physics.
Now let's look at how things have changed over time. This will give us a chance to review what we learnt about creating charts with two y-axes in Matplotlib and generating arrays with NumPy.
Challenge: Are more prizes awarded recently than when the prize was first created? Show the trend in awards visually.

dogerblue while the rolling average is coloured in crimson. 
prizes_per_year = df_data.groupby('year').agg({'prize': 'count'})
prizes_per_year
| prize | |
|---|---|
| year | |
| 1901 | 6 |
| 1902 | 7 |
| 1903 | 7 |
| 1904 | 6 |
| 1905 | 5 |
| ... | ... |
| 2016 | 11 |
| 2017 | 12 |
| 2018 | 13 |
| 2019 | 14 |
| 2020 | 12 |
117 rows × 1 columns
prizes_per_year_rolling = prizes_per_year.rolling(window=5).mean()
prizes_per_year_rolling
| prize | |
|---|---|
| year | |
| 1901 | NaN |
| 1902 | NaN |
| 1903 | NaN |
| 1904 | NaN |
| 1905 | 6.20 |
| ... | ... |
| 2016 | 11.60 |
| 2017 | 12.00 |
| 2018 | 12.00 |
| 2019 | 12.20 |
| 2020 | 12.40 |
117 rows × 1 columns
Note: We can explicitly set where you want to tick marks with plt.xticks and np.arange() -> see StackOverflow post
x_ticks = np.arange(1900, 2021, 5)
x_ticks
array([1900, 1905, 1910, 1915, 1920, 1925, 1930, 1935, 1940, 1945, 1950,
1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005,
2010, 2015, 2020])
plt.figure(figsize=(16,8), dpi=200)
plt.xticks(ticks=x_ticks, fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.title("Number of Nobel Prizes per Year", fontsize=18)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Number of Prizes', fontsize=16)
plt.xlim(1900, 2020)
ax1 = plt.gca() # Get Current Axes
ax2 = plt.gca()
ax1.scatter(prizes_per_year.index, prizes_per_year.prize, c='dodgerblue',
alpha=0.6, s=100)
ax2.plot(prizes_per_year.index, prizes_per_year_rolling.prize, c='crimson',
linewidth=3)
ax2.grid(color='grey', linestyle='--')
plt.show()
Challenge: Investigate if more prizes are shared than before.
prize_share = df_data.groupby('year').agg({'share_pct': 'mean'}).sort_values('year')
prize_share
| share_pct | |
|---|---|
| year | |
| 1901 | 0.83 |
| 1902 | 0.71 |
| 1903 | 0.71 |
| 1904 | 0.83 |
| 1905 | 1.00 |
| ... | ... |
| 2016 | 0.55 |
| 2017 | 0.50 |
| 2018 | 0.46 |
| 2019 | 0.43 |
| 2020 | 0.50 |
117 rows × 1 columns
prize_share_rolling = prize_share.rolling(window=5).mean()
plt.figure(figsize=(16,8), dpi=200)
plt.xticks(ticks=x_ticks, fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.title("Number of Nobel Prizes per Year", fontsize=18)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Number of Prizes', fontsize=16, color='crimson')
plt.xlim(1900, 2020)
ax1 = plt.gca() # Get Current Axes
ax2 = plt.gca()
ax3 = plt.twinx()
ax1.scatter(prizes_per_year.index, prizes_per_year.prize, c='dodgerblue',
alpha=0.6, s=100)
ax2.plot(prizes_per_year.index, prizes_per_year_rolling, c='crimson',
linewidth=3)
ax2.grid(color='grey', linestyle='--')
# Third positional argument: fmt = '[marker][line][color]'
ax3.plot(prizes_per_year.index, prize_share_rolling.share_pct, '--g', # dashed line green
linewidth=3)
ax3.set_ylim(0, 1)
ax3.set_ylabel('Prize Share', fontsize=16, color='green')
plt.setp(ax3.get_yticklabels(), fontsize=14)
ax3.invert_yaxis()
plt.show()
What do we see on the chart? Well, there is clearly an upward trend in the number of prizes being given out as more and more prizes are shared. Also, more prizes are being awarded from 1969 onwards because of the addition of the economics category. We also see that very few prizes were awarded during the first and second world wars. Note that instead of there being a zero entry for those years, we instead see the effect of the wards as missing blue dots.
Challenge:
top20_countries that has the two columns. The prize column should contain the total number of prizes won. 
birth_country, birth_country_current or organization_country? birth_country or any of the others? Which column is the least problematic? Then use plotly to create a horizontal bar chart showing the number of prizes won by each country. Here's what you're after: Link
What is the ranking for the top 20 countries in terms of the number of prizes?
Looking at our DataFrame there are actually 3 different columns to choose from for creating this ranking: birth_country, birth_country_current or organization_country. However, they each have certain problems and limitations.
If you look at the entries in the birth country, you'll see that some countries no longer exist! These include the Soviet Union or Czechoslovakia for example. Hence, using birth_country_current is better, since it has the country name which controls the city where the laureate was born. Now, notice that this does not determine the laureates' nationality since some globetrotting folks gave birth to their future Nobel laureate children while abroad. Also, people's nationalities can change as they emigrate and acquire different citizenship or get married and change citizenship. What this boils down to is that we will have to be clear about the assumptions that we will make in the upcoming analysis.
top20_countries = df_data.groupby('birth_country_current', as_index=False)\
.agg({'prize': 'count'}).sort_values('prize', ascending=False).head(20)
top20_countries
| birth_country_current | prize | |
|---|---|---|
| 74 | United States of America | 281 |
| 73 | United Kingdom | 105 |
| 26 | Germany | 84 |
| 25 | France | 57 |
| 67 | Sweden | 29 |
| 40 | Japan | 27 |
| 57 | Poland | 27 |
| 61 | Russia | 26 |
| 11 | Canada | 20 |
| 68 | Switzerland | 19 |
| 39 | Italy | 19 |
| 3 | Austria | 18 |
| 51 | Netherlands | 18 |
| 20 | Denmark | 12 |
| 13 | China | 12 |
| 54 | Norway | 12 |
| 2 | Australia | 10 |
| 7 | Belgium | 9 |
| 33 | India | 9 |
| 65 | South Africa | 9 |
fig4 = px.bar(top20_countries, x='prize', y='birth_country_current', color='prize',
orientation='h', height=500, title='Top 20 Countries by Number of Nobel Prizes',
color_continuous_scale='Viridis')
fig4.update_layout(xaxis_title='Number of Prizes', yaxis_title='Country',
coloraxis_showscale=False)
fig4.update_yaxes(categoryorder='total ascending')
fig4.show()
The United States has a massive number of prizes by this measure. The UK and Germany are in second and third place respectively.
matter on this map. Hint: You'll need to use a 3 letter country code for each country.
prizes_per_ISO = df_data.groupby('ISO', as_index=False)\
.agg({'prize': 'count'}).sort_values('prize', ascending=False)
prizes_per_ISO
| ISO | prize | |
|---|---|---|
| 73 | USA | 281 |
| 28 | GBR | 105 |
| 19 | DEU | 84 |
| 27 | FRA | 57 |
| 68 | SWE | 29 |
| ... | ... | ... |
| 50 | LVA | 1 |
| 51 | MAR | 1 |
| 52 | MDG | 1 |
| 54 | MKD | 1 |
| 78 | ZWE | 1 |
79 rows × 2 columns
fig5 = px.choropleth(prizes_per_ISO, locations='ISO', color='prize',
height=700, color_continuous_scale='matter',
labels={'ISO': 'Country', 'prize': "Nb of Nobel Prizes"})
# fig4.update_layout(coloraxis_showscale=False)
fig5.show()
Challenge: See if you can divide up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what you're aiming for.
The hard part is preparing the data for this chart!
Hint: Take a two-step approach. The first step is grouping the data by country and category. Then you can create a DataFrame that looks something like this:

Problem: If we groupby country and category first, we can't really select the top20 anymore. So we first need to fiter, from the original df, the top 20 countries. Then we can groupby.
The method .isin() allows to check in values are in another Series.
top20_countries_categ = df_data.loc[df_data.birth_country_current.isin(top20_countries.birth_country_current)]\
.groupby(['birth_country_current', 'category'], as_index=False).agg({'prize': 'count'})\
.sort_values('prize', ascending=False)
top20_countries_categ
| birth_country_current | category | prize | |
|---|---|---|---|
| 106 | United States of America | Medicine | 78 |
| 108 | United States of America | Physics | 70 |
| 103 | United States of America | Chemistry | 55 |
| 104 | United States of America | Economics | 49 |
| 100 | United Kingdom | Medicine | 28 |
| ... | ... | ... | ... |
| 18 | Canada | Peace | 1 |
| 62 | Netherlands | Peace | 1 |
| 28 | Denmark | Peace | 1 |
| 71 | Poland | Economics | 1 |
| 0 | Australia | Chemistry | 1 |
109 rows × 3 columns
fig6 = px.bar(top20_countries_categ, x='prize', y='birth_country_current', color='category',
orientation='h', height=500, title='Top 20 Countries by Number of Nobel Prizes and Category',
labels={'prize': 'Number of Prizes', 'birth_country_current': 'Country', 'category': 'Category'})
fig6.update_yaxes(categoryorder='total ascending')
fig6.show()
Splitting the country bar chart by category allows us to get a very granular look at the data and answer a whole bunch of questions. For example, we see is that the US has won an incredible proportion of the prizes in the field of Economics. In comparison, Japan and Germany have won very few or no economics prize at all. Also, the US has more prizes in physics or medicine alone than all of France's prizes combined. On the chart, we also see that Germany won more prizes in physics than the UK and that France has won more prizes in peace and literature than Germany, even though Germany has been awarded a higher total number of prizes than France.
birth_country_current of the winner to calculate this. # making sure the dataset is sorted by year and resetting index
df_data_sorted = df_data.sort_values('year').reset_index(drop=True)
cumul_prize_country_series = df_data_sorted.groupby('birth_country_current').cumcount()
# N.B. using .copy() below to make it a separate copy, and avoid warning when creating new column
df_cumul_country = df_data_sorted[['year', 'birth_country_current']].copy()
df_cumul_country['Cumulated Number of Nobel Prizes'] = cumul_prize_country_series
# The cumul. count starts with 0, whereas first occurrence should be 1 --> adding 1 to all
df_cumul_country['Cumulated Number of Nobel Prizes'] = df_cumul_country['Cumulated Number of Nobel Prizes'] + 1
df_cumul_country.rename(columns={'year': 'Year', 'birth_country_current': 'Country'}, inplace=True)
df_cumul_country.sort_values(['Country', 'Year'], inplace=True) # sorting by country then by year
df_cumul_country = df_cumul_country.dropna().reset_index(drop=True)
df_cumul_country
| Year | Country | Cumulated Number of Nobel Prizes | |
|---|---|---|---|
| 0 | 1957 | Algeria | 1.00 |
| 1 | 1997 | Algeria | 2.00 |
| 2 | 1936 | Argentina | 1.00 |
| 3 | 1947 | Argentina | 2.00 |
| 4 | 1980 | Argentina | 3.00 |
| ... | ... | ... | ... |
| 929 | 2020 | United States of America | 281.00 |
| 930 | 1980 | Venezuela | 1.00 |
| 931 | 1973 | Vietnam | 1.00 |
| 932 | 2011 | Yemen | 1.00 |
| 933 | 1960 | Zimbabwe | 1.00 |
934 rows × 3 columns
Note on copying DataFrames: It's interesting to note that when creating new variables containing Series or DataFrame, they actually still point to the original data, so that if the original data is changed, it will be reflected on all DataFrames and Series depending from it. At least, that's how I understood it.
If we use .copy() though, it will create an entirely separate entity, detached from the original.
This is one way to avoid the "SettingWithCopyWarning" when creating a new column. Other ways would probably include adding the column in the original df instead of a copy, or using the .insert() method.
Remarks on the created DataFrame: Contrary to the course solution, with this method and because of the way the .cumcount() Series is created, we keep the exact same number of rows as the original df, which means also the rows where the Country is NaN (if organization instead of people).
Also, if a country wins more than one prize in a year, we have several entries in our df, for each prize, which can make it tricky for the graph when sorting by country, because it has two or more points for the same year. In the course solution. there is only one entry per country per year, i.e. the cumulative count/sum jumps to 2 more directly.
This explains why, even if we drop the NaN values in our df, we still have way more rows than in the solution.
Note on sort_values(): We can actually sort by more than one column, e.g sort by country, then by year.
Course solution: First we count the number of prizes by country by year. It's important we keep the new indices for next step (so not using as_index=False this time).
prize_by_year = df_data.groupby(by=['birth_country_current', 'year']).agg({'prize': 'count'})
prize_by_year
| prize | ||
|---|---|---|
| birth_country_current | year | |
| Algeria | 1957 | 1 |
| 1997 | 1 | |
| Argentina | 1936 | 1 |
| 1947 | 1 | |
| 1980 | 1 | |
| ... | ... | ... |
| United States of America | 2020 | 7 |
| Venezuela | 1980 | 1 |
| Vietnam | 1973 | 1 |
| Yemen | 2011 | 1 |
| Zimbabwe | 1960 | 1 |
627 rows × 1 columns
Then we can use the cumulative sum, by country, instead of the cumulative count. And in order not to lose the country column, we keep it as an index and we must specify to group by an index level.
cumulative_prizes = prize_by_year.groupby(level=[0]).cumsum()
cumulative_prizes
| prize | ||
|---|---|---|
| birth_country_current | year | |
| Algeria | 1957 | 1 |
| 1997 | 2 | |
| Argentina | 1936 | 1 |
| 1947 | 2 | |
| 1980 | 3 | |
| ... | ... | ... |
| United States of America | 2020 | 281 |
| Venezuela | 1980 | 1 |
| Vietnam | 1973 | 1 |
| Yemen | 2011 | 1 |
| Zimbabwe | 1960 | 1 |
627 rows × 1 columns
Only then, we can reset the index, for better use in the graph, which doesn't know how to deal with Multilevel Index.
cumulative_prizes.reset_index(inplace=True)
cumulative_prizes
| birth_country_current | year | prize | |
|---|---|---|---|
| 0 | Algeria | 1957 | 1 |
| 1 | Algeria | 1997 | 2 |
| 2 | Argentina | 1936 | 1 |
| 3 | Argentina | 1947 | 2 |
| 4 | Argentina | 1980 | 3 |
| ... | ... | ... | ... |
| 622 | United States of America | 2020 | 281 |
| 623 | Venezuela | 1980 | 1 |
| 624 | Vietnam | 1973 | 1 |
| 625 | Yemen | 2011 | 1 |
| 626 | Zimbabwe | 1960 | 1 |
627 rows × 3 columns
fig7 = px.line(df_cumul_country, x='Year', y='Cumulated Number of Nobel Prizes',
color='Country', height=500)
fig7.show()
fig8 = px.line(cumulative_prizes, x='year', y='prize',
color='birth_country_current', height=500,
labels={'year': 'Year', 'prize': 'Cumulated Number of Prizes',
'birth_country_current': 'Country'})
fig8.show()
What we see is that the United States really started to take off after the Second World War which decimated Europe. Prior to that, the Nobel prize was pretty much a European affair. Very few laureates were chosen from other parts of the world. This has changed dramatically in the last 40 years or so. There are many more countries represented today than in the early days. Interestingly we also see that the UK and Germany traded places in the 70s and 90s on the total number of prizes won. Sweden being 5th place pretty consistently over many decades is quite interesting too. Perhaps this reflects a little bit of home bias?
Many Nobel laureates are affiliated with a university, a laboratory, or a research organisation (apart from Literature and Peace prize winners as we've seen). But the world is a big place. Which research institutions had the most Nobel laureates working there at the time of making the discovery?
Challenge: Create a bar chart showing the organisations affiliated with the Nobel laureates. It should looks something like this.
# Two homemade methods to just return the unique grouping names that one can expect when using groupby
# Both methods return the exact same result
orgs = df_data[['organization_country', 'organization_city', 'organization_name']].dropna().drop_duplicates()\
.reset_index(drop=True).sort_values(['organization_country', 'organization_city', 'organization_name'])
df_data.groupby(['organization_country', 'organization_city', 'organization_name'], as_index=False)\
.agg({'prize': 'count'}).drop('prize', axis='columns')
| organization_country | organization_city | organization_name | |
|---|---|---|---|
| 0 | Argentina | Buenos Aires | Institute for Biochemical Research |
| 1 | Argentina | Buenos Aires | Instituto de Biologia y Medicina Experimental ... |
| 2 | Australia | Canberra | Australian National University |
| 3 | Australia | Melbourne | Walter and Eliza Hall Institute for Medical Re... |
| 4 | Australia | Nedlands | NHMRC Helicobacter pylori Research Laboratory,... |
| ... | ... | ... | ... |
| 286 | United States of America | West Lafayette, IN | Purdue University |
| 287 | United States of America | Wilmington, DE | Du Pont |
| 288 | United States of America | Woods Hole, MA | Marine Biological Laboratory (MBL) |
| 289 | United States of America | Worcester, MA | University of Massachusetts Medical School |
| 290 | United States of America | Yorktown Heights, NY | IBM Thomas J. Watson Research Center |
291 rows × 3 columns
orgs.loc[orgs.organization_name == 'University of California']
| organization_country | organization_city | organization_name | |
|---|---|---|---|
| 73 | United States of America | Berkeley, CA | University of California |
| 205 | United States of America | Irvine, CA | University of California |
| 116 | United States of America | Los Angeles, CA | University of California |
| 124 | United States of America | San Diego, CA | University of California |
| 190 | United States of America | San Francisco, CA | University of California |
| 214 | United States of America | Santa Barbara, CA | University of California |
This is a bias not explained in the next analysis, when sorting by organization_name, the University of California is actually divided between 6 different sites...
prize_per_organization = df_data.groupby('organization_name').agg({'prize': 'count'})
prize_per_organization.reset_index(inplace=True)
# Using lists to sort by nb of prizes then by alphabet
prize_per_organization.sort_values(['prize', 'organization_name'], ascending=[False, True], inplace=True)
top20_organizations = prize_per_organization.head(20)
top20_organizations
| organization_name | prize | |
|---|---|---|
| 196 | University of California | 40 |
| 68 | Harvard University | 29 |
| 167 | Stanford University | 23 |
| 117 | Massachusetts Institute of Technology (MIT) | 21 |
| 198 | University of Chicago | 20 |
| 197 | University of Cambridge | 18 |
| 26 | California Institute of Technology (Caltech) | 17 |
| 38 | Columbia University | 17 |
| 146 | Princeton University | 15 |
| 119 | Max-Planck-Institut | 13 |
| 152 | Rockefeller University | 13 |
| 222 | University of Oxford | 12 |
| 111 | MRC Laboratory of Molecular Biology | 10 |
| 258 | Yale University | 9 |
| 12 | Bell Laboratories | 8 |
| 40 | Cornell University | 8 |
| 67 | Harvard Medical School | 7 |
| 79 | Institut Pasteur | 7 |
| 109 | London University | 7 |
| 163 | Sorbonne University | 7 |
fig9 = px.bar(top20_organizations, x='prize', y='organization_name', color='prize',
orientation='h', height=500, title='Top 20 Research Institutions by Number of Nobel Prizes',
color_continuous_scale='Viridis')
fig9.update_layout(xaxis_title='Number of Prizes', yaxis_title='Institution',
coloraxis_showscale=False)
fig9.update_yaxes(categoryorder='total ascending')
fig9.show()
Each research organisation is located in a particular city. Are some cities hot spots for scientific discoveries?
Where do major discoveries take place?
Challenge:
prize_per_city = df_data.groupby('organization_city', as_index=False).agg({'prize': 'count'}).sort_values('prize', ascending=False)
top20_cities = prize_per_city.head(20)
top20_cities
| organization_city | prize | |
|---|---|---|
| 34 | Cambridge, MA | 50 |
| 121 | New York, NY | 45 |
| 33 | Cambridge | 31 |
| 92 | London | 27 |
| 128 | Paris | 25 |
| 156 | Stanford, CA | 24 |
| 12 | Berkeley, CA | 21 |
| 39 | Chicago, IL | 20 |
| 135 | Princeton, NJ | 19 |
| 129 | Pasadena, CA | 17 |
| 126 | Oxford | 12 |
| 13 | Berlin | 12 |
| 157 | Stockholm | 10 |
| 108 | Moscow | 10 |
| 71 | Heidelberg | 9 |
| 110 | Munich | 9 |
| 23 | Boston, MA | 9 |
| 119 | New Haven, CT | 9 |
| 78 | Ithaca, NY | 8 |
| 65 | Göttingen | 8 |
fig10 = px.bar(top20_cities, x='prize', y='organization_city', color='prize',
orientation='h', height=500, title='Top 20 Cities by Number of Nobel Prizes',
color_continuous_scale='Plasma',
labels={'organization_city': 'City', 'prize': 'Number of Prizes'})
fig10.update_layout(coloraxis_showscale=False)
fig10.update_yaxes(categoryorder='total ascending')
fig10.show()
Challenge:
Plasma for the chart.prize_per_birth_city = df_data.groupby('birth_city', as_index=False).agg({'prize': 'count'}).sort_values('prize', ascending=False)
top20_birth_cities = prize_per_birth_city.head(20)
fig11 = px.bar(top20_birth_cities, x='prize', y='birth_city', color='prize',
orientation='h', height=500, title='Top 20 Birth Cities by Number of Nobel Prizes',
color_continuous_scale='Plasma',
labels={'birth_city': 'City', 'prize': 'Number of Prizes'})
fig11.update_layout(coloraxis_showscale=False)
fig11.update_yaxes(categoryorder='total ascending')
fig11.show()
A higher population definitely means that there's a higher chance of a Nobel laureate to be born there. New York, Paris, and London are all very populous. However, Vienna and Budapest are not and still produced many prize winners. That said, much of the ground-breaking research does not take place in big population centres, so the list of birth cities is quite different from the list above. Cambridge Massachusets, Stanford, Berkely and Cambridge (UK) are all the places where many discoveries are made, but they are not the birthplaces of laureates.
total_prizes_birth_USA = len(df_data.loc[df_data.birth_country_current == 'United States of America'])
total_prizes_birth_NYC = len(df_data.loc[df_data.birth_city == 'New York, NY'])
pct_NYC = total_prizes_birth_NYC/total_prizes_birth_USA*100
print(f"{pct_NYC:.0f}% of the United States prizes came from Nobel laureates born in New York.")
19% of the United States prizes came from Nobel laureates born in New York.
Challenge:
Here's what you're aiming for: link
Note: The sunburst graph is interactive! We can click on the elements.
prize_orgs = df_data.groupby(['organization_country', 'organization_city', 'organization_name'],
as_index=False).agg({'prize': 'count'}).rename(columns={'prize': 'nb_of_nobel_prizes'})
fig12 = px.sunburst(prize_orgs, path=['organization_country', 'organization_city', 'organization_name'],
values='nb_of_nobel_prizes', height=1200,
title='Nobel Prizes distributed by Research Institutions')
fig12.show()
df_data.insert(loc=8, column='winning_age', value=df_data.year - df_data.birth_date.dt.year)
df_data.iloc[:5,:10] # df.iloc[row_start:row_end, column_start:column_end]
| year | category | prize | motivation | prize_share | share_pct | laureate_type | full_name | winning_age | birth_date | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | 1.00 | Individual | Jacobus Henricus van 't Hoff | 49.00 | 1852-08-30 |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | 1.00 | Individual | Sully Prudhomme | 62.00 | 1839-03-16 |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | 1.00 | Individual | Emil Adolf von Behring | 47.00 | 1854-03-15 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | 0.50 | Individual | Frédéric Passy | 79.00 | 1822-05-20 |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | 0.50 | Individual | Jean Henry Dunant | 73.00 | 1828-05-08 |
Challenge:
df_data.loc[df_data.winning_age == df_data.winning_age.min()].iloc[:, :14]
| year | category | prize | motivation | prize_share | share_pct | laureate_type | full_name | winning_age | birth_date | birth_city | birth_country | birth_country_current | sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 885 | 2014 | Peace | The Nobel Peace Prize 2014 | "for their struggle against the suppression of... | 1/2 | 0.50 | Individual | Malala Yousafzai | 17.00 | 1997-07-12 | Mingora | Pakistan | Pakistan | Female |
The youngest Nobel laureate is Malala Yousafzai at 17 years old.
df_data.loc[df_data.winning_age == df_data.winning_age.max()].iloc[:, :14]
| year | category | prize | motivation | prize_share | share_pct | laureate_type | full_name | winning_age | birth_date | birth_city | birth_country | birth_country_current | sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 937 | 2019 | Chemistry | The Nobel Prize in Chemistry 2019 | “for the development of lithium-ion batteries” | 1/3 | 0.33 | Individual | John Goodenough | 97.00 | 1922-07-25 | Jena | Germany | Germany | Male |
The youngest Nobel laureate is John Goodenough at 97 years old.
print(f"The average age of a Nobel prize winner is {df_data.winning_age.mean():.0f} years old.")
The average age of a Nobel prize winner is 60 years old.
bin size. Try 10, 20, 30, and 50. df_data.winning_age.to_frame().describe()
| winning_age | |
|---|---|
| count | 934.00 |
| mean | 59.95 |
| std | 12.62 |
| min | 17.00 |
| 25% | 51.00 |
| 50% | 60.00 |
| 75% | 69.00 |
| max | 97.00 |
75% of the laureates are less than 69 years old.
prize_per_age = df_data.winning_age.to_frame().dropna().sort_values('winning_age').reset_index(drop=True)
prize_per_age.winning_age = prize_per_age.winning_age.astype('int64')
prize_per_age
| winning_age | |
|---|---|
| 0 | 17 |
| 1 | 25 |
| 2 | 25 |
| 3 | 31 |
| 4 | 31 |
| ... | ... |
| 929 | 89 |
| 930 | 89 |
| 931 | 90 |
| 932 | 96 |
| 933 | 97 |
934 rows × 1 columns
np.random.randint(0, 5, 10)
array([3, 3, 1, 1, 2, 2, 1, 0, 0, 0])
hist_x_ticks = np.arange(10, 101, 10)
hist_x_ticks
array([ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
plt.figure(figsize=(8,4), dpi=200)
# using ticks style (so xticks appear) and overriding background color
with sns.axes_style('ticks', rc={'axes.facecolor': '#EAEAF2'}):
hist = sns.histplot(x=prize_per_age.winning_age, bins=30)
hist.set(xlabel='Age of Nobel Prize Laureates')
hist.grid(color='white', linestyle='--', axis='y')
# plt.xticks(ticks=hist_x_ticks)
plt.show()
Options for the histogram include:
bins=20discrete=Trueplt.xticks(ticks=np.arange(10, 101, 10))Tips:
display().nlargest(), .nsmallest()display(df_data.nlargest(n=1, columns='winning_age'))
display(df_data.nsmallest(n=1, columns='winning_age'))
Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?
Challenge
lowess parameter to True to show a moving average of the linear fit.age_per_year = df_data[['year', 'winning_age']].dropna().reset_index(drop=True)
age_per_year
| year | winning_age | |
|---|---|---|
| 0 | 1901 | 49.00 |
| 1 | 1901 | 62.00 |
| 2 | 1901 | 47.00 |
| 3 | 1901 | 79.00 |
| 4 | 1901 | 73.00 |
| ... | ... | ... |
| 929 | 2020 | 85.00 |
| 930 | 2020 | 71.00 |
| 931 | 2020 | 55.00 |
| 932 | 2020 | 68.00 |
| 933 | 2020 | 89.00 |
934 rows × 2 columns
Instead of a straight regression line, using the lowess parameter gives us a moving linear fit ("moving average of the linear fit").
Alternatively, we can also set the order of the polynomial fit to a higher degree, e.g. order=2
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
# N.B. with regplot, the styles of the line and of the scatter must be set separately
# in these two parameters: scatter_kws and line_kws
reg = sns.regplot(x=age_per_year.year, y=age_per_year.winning_age,
lowess=True, # must install the 'statsmodels' package to use this parameter
# order=2,
scatter_kws={'s':8, 'alpha': 0.4, 'color': '#2f4b7c'}, # s for the size of the dots
line_kws={'color': 'brown', 'linewidth': 2})
reg.set(xlabel='Year of Nobel Prize Award', ylabel='Age of Prize Laureate')
plt.show()
Using the lowess parameter allows us to plot a local linear regression. This means the best fit line is still linear, but it's more like a moving average which gives us a non-linear shape across the entire series. This is super neat because it clearly shows how the Nobel laureates are getting their award later and later in life. From 1900 to around 1950, the laureates were around 55 years old, but these days they are closer to 70 years old when they get their award! The other thing that we see in the chart is that in the last 10 years the spread has increased. We've had more very young and very old winners. In 1950s/60s winners were between 30 and 80 years old. Lately, that range has widened.
How does the age of laureates vary by category?
.boxplot() to show how the mean, quartiles, max, and minimum values vary across categories. Which category has the longest "whiskers"? age_by_category = df_data[['category', 'winning_age']].dropna().reset_index(drop=True)\
.rename(columns={'category': 'Nobel Prize Category', 'winning_age': 'Age of Prize Laureate'})
age_by_category
| Nobel Prize Category | Age of Prize Laureate | |
|---|---|---|
| 0 | Chemistry | 49.00 |
| 1 | Literature | 62.00 |
| 2 | Medicine | 47.00 |
| 3 | Peace | 79.00 |
| 4 | Peace | 73.00 |
| ... | ... | ... |
| 929 | Medicine | 85.00 |
| 930 | Medicine | 71.00 |
| 931 | Physics | 55.00 |
| 932 | Physics | 68.00 |
| 933 | Physics | 89.00 |
934 rows × 2 columns
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
box = sns.boxplot(data=age_by_category,
x='Nobel Prize Category', y='Age of Prize Laureate',
# hue='Nobel Prize Category' # no need for it, colors already applied
)
# plt.legend([],[], frameon=False) # to remove the legend, if there is one
plt.show()
The box plot shows us the mean, the quartiles, the maximum and the minimum values. It raises an interesting question: "Are peace prize winners really older than physics laureates?".
Note: can also do the box plot with Plotly Express px
Challenge
.lmplot() combines regplot() and FacetGrid. It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.
row parameter to create 6 separate charts for each prize category. Again set lowess to True..lmplot() telling a different story from the .boxplot()?.lmplot() to put all 6 categories on the same chart using the hue parameter. age_per_year_and_category = df_data[['year', 'category', 'winning_age']].dropna().reset_index(drop=True)\
.rename(columns={'year': 'Year', 'category': 'Category', 'winning_age': 'Age of Prize Laureate'})
age_per_year_and_category
| Year | Category | Age of Prize Laureate | |
|---|---|---|---|
| 0 | 1901 | Chemistry | 49.00 |
| 1 | 1901 | Literature | 62.00 |
| 2 | 1901 | Medicine | 47.00 |
| 3 | 1901 | Peace | 79.00 |
| 4 | 1901 | Peace | 73.00 |
| ... | ... | ... | ... |
| 929 | 2020 | Medicine | 85.00 |
| 930 | 2020 | Medicine | 71.00 |
| 931 | 2020 | Physics | 55.00 |
| 932 | 2020 | Physics | 68.00 |
| 933 | 2020 | Physics | 89.00 |
934 rows × 3 columns
plt.figure(dpi=200)
sns.set_style('darkgrid')
sns.lmplot(data=age_per_year_and_category,
x='Year', y='Age of Prize Laureate',
hue='Category', lowess=True,
scatter_kws = {'alpha': 0.5},
line_kws={'linewidth': 3},
height=8, aspect=1.5) # Height (in inches), and aspect ratio, so that aspect * height gives the width
plt.title='Trend of Nobel Prize Laureate Age per Category'
plt.show()
<Figure size 1280x960 with 0 Axes>
We can put of the graphs (facets) in individual rows with the row='Category' argument, but it's a bit of a waste of space. Instead, we put them in columns, but we say to wrap to a new row after 2 graphs: col='Category', col_wrap=2
plt.figure(dpi=200)
sns.set_style('darkgrid')
sns.lmplot(data=age_per_year_and_category,
scatter_kws = {'alpha': 0.5},
line_kws={'linewidth': 3},
x='Year', y='Age of Prize Laureate',
col='Category', hue='Category', lowess=True,
col_wrap=2,
height=4, aspect=2) # of each facet
plt.show()
<Figure size 1280x960 with 0 Axes>
We see that winners in physics, chemistry, and medicine have gotten older over time. The ageing trend is strongest for physics. The average age used to be below 50, but now it's over 70. Economics, the newest category, is much more stable in comparison. The peace prize shows the opposite trend where winners are getting younger! As such, our scatter plots showing the best fit lines over time and our box plot of the entire dataset can tell very different stories!
In this lesson, we reviewed many concepts that we've covered previously, including:
Create a rolling average to smooth out time-series data and show a trend, e.g. prizes_per_year.rolling(window=5).mean()
How to use .value_counts(), .groupby(), .merge(), .sort_values() and .agg(). Note: sometimes .value_counts() more elegant than .groupby().agg({'column': 'count'})
Summary of all the import statements:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
Using the .transpose() or the .T to get the transposed view of a DataFrame
Using .insert(): df_data.insert(loc=5, column='share_pct', value=numerator/denominator)
Using .drop([labels_list], axis='columns') when dropping columns (and axis="rows" by default otherwise)
Using .duplicated() with parameters, e.g. df_data.duplicated(subset=['full_name'], keep=False)
Using .nunique() to find number of unique values in a Series
Using .reset_index, for example after dropping values: df_cumul_country.dropna().reset_index(drop=True)
Using .dt to access datetime attributes or methods, applied to a Series, e.g. df.birth_year.dt.year
Using .describe() to return "Descriptive Statistics" about a DataFrame
Using astype() to convert a Series to a datatype, e.g. .astype('int64')
Creating plotly donut charts with .pie() and some parameters like hole=0.5, height=500, title=
Creating plotly bar charts with .bar() and parameters like labels={'old': 'new'}, color_continuous_scale='Aggrnyl', color=category, as well as orientation='h' to set the bars horizontal (consider the new x and y params in this case)
Customizing plotly charts with .update_traces() and parameters like textfont_size=15, coloraxis_showscale=False
Changing the order of the bars in plotly with fig3.update_xaxes(categoryorder='total descending')
Creating regression plots with seaborn and .regplot() and changing the style of the scatter plot with scatter_kws={} and of the reg line with line_kws={}, e.g.
scatter_kws={'s':8, 'alpha': 0.4, 'color': '#2f4b7c'}, # s for the size of the dots
line_kws={'color': 'brown', 'linewidth': 2}
.plot() and .scatter, and using all the parameters and methods such as: plt.figure(figsize=(16,8), dpi=200) # setting the size and resolution for display
plt.xticks(ticks=x_ticks, fontsize=14, rotation=45) # styling the x-ticks
plt.yticks(fontsize=14) # styling the y-ticks
plt.title("Number of Nobel Prizes per Year", fontsize=18) # give a title
plt.xlabel('Year', fontsize=16) # styling the x-label
plt.ylabel('Number of Prizes', fontsize=16, color='crimson') # styling the x-label
plt.xlim(1900, 2020) # setting the x-axis boundaries
ax1 = plt.gca() # Get Current Axes
ax2 = plt.twinx() # Creating a second y-axis (which shares the same x-axis)
ax1.scatter(prizes_per_year.index, prizes_per_year.prize, c='dodgerblue',
alpha=0.6, s=100) # c for color, alpha transparency, s marker size
ax1.plot(prizes_per_year.index, prizes_per_year_rolling, c='crimson',
linewidth=3) # increasing linewidth
ax1.grid(color='grey', linestyle='--') # adding a grid
# Third positional argument in plot is fmt = '[marker][line][color]'
ax2.plot(prizes_per_year.index, prize_share_rolling.share_pct, '--g', # dashed line green
linewidth=3)
ax2.set_ylim(0, 1)
ax2.set_ylabel('Prize Share', fontsize=16, color='green')
plt.setp(ax3.get_yticklabels(), fontsize=14)
ax2.invert_yaxis()
plt.show()
In addition, we learned many new things too. We looked at how to:
! or % characters (see saved links in folder "IPython magic commands ! $ % etc - run shell commands"), e.g.!pip install --upgrade plotly
col_subset = ['year','category', 'laureate_type',
'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset] # we can reuse a subset like this!
.str.split() method returns a Series where values are lists of two items such as [1, 1] (unless we use the expand=True parameter). Then we use .map() or .apply() and give it a lambda function to get hold of the first or second element. See StackOverflow post. E.g.numerator = df_data.prize_share.str.split('/').map(lambda x: x[0]).astype('int64')
denominator = df_data.prize_share.str.split('/').map(lambda x: x[1]).astype('int64')
.loc[] conditions (in general with pandas and numpy, we use the bitwise NOT ~ instead of ! or not):df_people = df_data.loc[not df_data.sex.isna()] # where isna not True -> NOK
df_people = df_data.loc[~ df_data.sex.isna()] # OK
.iloc[] -> df.iloc[row_start:row_end , col_start, col_end], e.g.df_data.iloc[2:4, :6]
Use .fillna() to replace NaN values with a specific value, e.g. category_and_gender.fillna(value='Organization', inplace=True)
Using Numpy .arange() method to create a custom array of ticks for our graph. Also knowing that np.random.randint(0, 5, 10) returns 10 random values between 0 included and 5 excluded.
x_ticks = np.arange(1900, 2021, 5)
plt.xticks(ticks=x_ticks, fontsize=14, rotation=45)
.isin() to check values of a Series that are in another Series, e.g.df_data.loc[df_data.birth_country_current.isin(top20_countries.birth_country_current)]
.cumcount() method wisely. It only returns a Series, and the count starts at 0, e.g.cumul_prize_country_series = df_data_sorted.groupby('birth_country_current').cumcount()
df_cumul_country['Cumulated Number of Nobel Prizes'] = df_cumul_country['Cumulated Number of Nobel Prizes'] + 1
.copy() method when/if getting a warning about trying to create a column, e.g. df_cumul_country = df_data_sorted[['year', 'birth_country_current']].copy()
.count() and then .cumsum(), and the ability to group by an index level, as an alternative to .cumcount(), e.g.prize_by_year = df_data.groupby(by=['birth_country_current', 'year']).agg({'prize': 'count'})
cumulative_prizes = prize_by_year.groupby(level=[0]).cumsum()
Use float formatting in an f-string, e.g. print(f"blabla the age is {df_data.winning_age.mean():.0f} years old.")
Force-display several elements in the same block of code, without using print with : display()
.nlargest(), .nsmallest()display(df_data.nlargest(n=1, columns='winning_age'))
display(df_data.nsmallest(n=1, columns='winning_age'))
Create a Choropleth to display data on a map with plotly px.choropleth() and parameters like locations='ISO' and color_continuous_scale='matter'
Use plotly to display a whole bunch of lines at once (without doing a for loop like with matplotlib) with .line() and the color= parameter
Create Sunburst charts with plotly, e.g.
fig12 = px.sunburst(prize_orgs, path=['organization_country', 'organization_city', 'organization_name'],
values='nb_of_nobel_prizes', height=1200,
title='Nobel Prizes distributed by Research Institutions')
fig12.show()
.histplot() method, as well as giving a pre-defined style and override some of its values with rc={}, and changing the number of bars/bins with bins=, and more:plt.figure(figsize=(8,4), dpi=200)
# using ticks style (so xticks appear) and overriding background color
with sns.axes_style('ticks', rc={'axes.facecolor': '#EAEAF2'}):
hist = sns.histplot(x=prize_per_age.winning_age, bins=30)
hist.set(xlabel='Age of Nobel Prize Laureates')
hist.grid(color='white', linestyle='--', axis='y')
# plt.xticks(ticks=hist_x_ticks)
plt.show()
Create a regression plot with Seaborn with a moving average line using the lowess=True parameter (statsmodels package must be installed though).
Create box plots with Seaborn and sns.boxplot()
Remove legend in Seaborn or matplotlib plots with plt.legend([],[], frameon=False
Use Seaborn's .lmplot() and show scatter and regression/best-fit lines across multiple categories using the hue, height, aspect and lowess parameters, either in one window or in multiple with col, col_wrap and row.